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(57) Abstract 

A reaRime pipeline processor, which is particularly suited for VLSI implementation, is based on a hardware oriented radii-2 2 
algorithm derived by Integrating a twiddle factor decomposition technique in a divide and conquer approach. The radix 2 2 algorithm has ihe 
same multiplicative complexity as aradix-4 algorithm, but retains the butterfly structure of a radlx-2 algorithm. A single-path delay-feedback 
architecture is used in order to exploit the spatial regularity in the signal flow graph of the algorithm. For a length-N DFT transform, the 
hardware requirements of the processor proposed by the present invention is minimal on both dominant components: Log^N-J complex 
multipliers, and N*1 complex data memory. 
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Improvements in or Relati ng to Real-Time Pipeline 
Fa st Fourier Transform Processors 



The present invention relates to real-time pipeline 
fast fourier transform processors and, in particular. 
5 such processors based on a radix-2 2 algorithm. 

Pipeline digital fourier transform processors are 
a specified class of processors used to perform DFT 
computations. A real-time pipeline processor is a 
processor whose processing speed matches the input data 

10 rate* i.e. the data acquisition speed for continuous 

operation. For an FFT processor, this means that a 
length 1 N 1 DFT must be computed in ■ N ' clock cycles 
since the data acquisition speed is one sample per 
cycle. Pipeline operation enables a partial result, 

15 obtained from a preceding stage of the processor, to be 

immediately used in a following stage, without delay. 

FFT processors find application, inter alia, in 
digital mobile cellular radio systems where there ekists 
considerable constraints on power consumption and chip 
20 size. The primary constraining factor may, therefore, 

be chip complexity, in terms of the number of adders, 
the number of multipliers, data storage requirements and 
control complexity, rather than speed of operation. 

The present invention emerges from a new approach 
25 to the design of real-time pipeline FFT processors. The 

architecture of a real-time FFT processor, according to 
the present invention, can be described as a radix-2* 
single-path delay feedback, or R2 2 SDF, architecture . 
Such a processor can operate on the basis of a hardware 
30 oriented radix~2 2 algorithm, developed by integrating a 

twiddle factor decomposition technique in a divide and 
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conquer approach to form a spatially regular signal flow 
graph, In a divide and conquer technique the 

computation of a DFT is decomposed into nested DFTs of 
shorter length. Divide and conquer techniques are well 
b known in the derivation of fast algorithms and t in the 

case of the present invention, refer to approachs in 
which an N-point DFT is decomposed into successively 
smaller DFTs which are then computed separately and 
combined to give the final result. The twiddle Factor 

10 refers to intervening phase shift, or rotational factor. 

In the present invention, two stages of radix-? 
decomposition are performed together and re-decomposed, 
so that the first stage has only trivial factors which 
do not require multiplication. However, it should be 

15 noted that the two steps are not computed 

simultaneously. 



The algorithm used in the present invention is 
referred to as a radix-2 2 algorithm because it has the 
same multiplicative complexity as a radix-4 algorithm 

20 but requires radix-2 butterflies in its signal flow 

graph. The architecture of the processor is described 
as a single^path delay feedback because only a single 
data path exists between butterfly stages and each 
butterfly uses a FIFO buffer in the feedback loop. The 

25 signal flow graph is described as spatially regular, 

because only every alternate column in the SFG has a 
non-trivial multiplicative operation. This contrasts 
with an ordinary radix-2 SFG in which there is a non- 
trivial multiplication in every column in the SFG. 

30 A pipeline DFT processor is characterised by real- 

time continuous processing of the data sequence passed 
to the processor. The time complexity of the processor 
is N and, therefore, it is an AT Z non-optimal approach 
with AT 2 = 0(N 3 )» since the area lower bound is 0(N). 

35 However, in ["Fourier transform in VLSI" - C. D. 
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Thompson. IEEE Trans. Compute C-32 ( 1 1 ) : J04 7 - J 057 , Nov. 
J90JJ, it has been suggested that for real-cime 
processing a new metric should be introduced, since it 
is necessarily non-optimal given the time complexity of 
5 O(N). Although asymptotically almost all the feasible 

approaches have reached the area lower bound, [see S. He 
And M. Tork&lson "A new expandable 2D systolic array for 
DFT computation based on symbiosis of ID arrays" Proc. 
ICA } PP'95* pages 12-19, Brisbane Australia Apr 1995} t one 
10 particular class of pipeline processors with the 

application of recursive Common Factor Algorithm, 
(collectively known as Fast Fourier Transform), [see C. 
S. Burrus "Index mapping for multidimensional 
formulation of the DFT and convolution" - IEEE Trans- 
it Acoust., Speech, Signal Processing, ASSP-25 ( J ): 239*242 
June 1977'], has probably the smallest "constant factor" 
among the approaches that meet the time requirement, due 
to the least number, O(logN), of processing elements. 
The difference comes from the fact that an arithmetic 
20 unit* especially the multiplier, takes up a much larger 
area than a digital register in digital VLSI design. 

It should be noted that at least ft(logN) PEs, with 
multipliers, are needed to meet the real-time processing 
requirements due to the multiplicative computational 
25 complexity of n(NlogN) for FFT algorithms. Thus, this 

is in the nature of a "lower bound" for multiplier 
requirements. Any optimal architecture for real-time 
processing will probably have Q(logN) multipliers. 

Another major chip area and energy consumer, for a 
30 FFT processor, is the memory requirement for buffering 

the input data and intermediate results for the 
computation. For large transforms, this is a dominant 
factor, [see E* E. Swartzlander err al- "A radix-4 delay 
commutator for fast Fourier transform processor 
35 implementation" IEEE J, Solid-State Circuits, SC- 
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19(5) :702-709 Oct 1984: and £. Bidet et ai, "A fast 
single-chip implementation of 8192 complex point FFT" 
IEEE J. Solid-State Circuits, 30( 3 ): 300-305 Mar 2995]. 
Although there is no formal proof, the area lower bound 
5 indicates that the lower bound for the number of 

registers is likely to be fl|N). This is obviously true 
for any architecture implementing an FFT based 
algorithm, since the butterfly at the first stage has to 
take data elements separated by N/r % from the input 
10 sequence, where r is a small constant integer, or the 

radix . 



Combining the above arguments suggests a pipeline 
FFT processor with fl(log : N) PEs, or multipliers, and fi(N) 
complex word registers. The optimal architecture has to 
15 be one that reduces the constant factor, or the absolute 

number of arithmetic units (multipliers and adders) and 
memory size, to the minimum. 

Some of the known architectures for pipeline 
processors will now be considered. In order to prevent 
20 the comparison between different architectures being 

affected by sequence order* it will be assumed that the 
real-time processing task only requires the input time 
sequence to be in normal order and that „the output can 
be in digit reversed (radix-Z, or radix-4) order. This 
25 is permissible in such applications as DFT based 

communication systems, [see Af- Alard and /?- Lassalle 
"Principles o£ modulation and Channel Coding for digital 
broadcasting and motile receivers" EBU Review ( 22* ) : 47- 
69 Aug J957]. DIF type decompositions are used 

30 throughout - 

The architecture design for pipeline FFT Processors 
has been the subject of intensive research since the 
■70s, when a need for real-time processing was 
identified in such applications as radar signal 
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processing, [see L . /?, Raoiner and B. Cold "Theory and 
Application of Digital Signal Processing- - Prencice- 
Hall J 975]. well before VLSI technology had advanced to 
the level of system Integra tion . Several architectures 
have been proposed over the last two decades. These 
architectures are briefly reviewed below using a unified 
terminology and functional block diagrams in which the 
additive butterfly is separated from the multiplier to 
clearly indicate the hardware requirements. The control 
and twiddle factor reading mechanism has been omitted 
for clarity. All data and arithmetic operations are 
complex and N is a power of 4. 

A R2MDC processor is illustrated in Figure 1 of the 
accompanying drawings, [see L. R. Rabiner and B . Cold 
13 "Theory and Application of Digital Signal Processing" - 

Prentice-Mall 19?£]. This is a radix-2 Multi-path Delay 
Commutator architecture which is probably the most 
straight forward approach to pipeline implementation of 
a radix^2 FFT algorithm. The input sequence is broken 
into two parallel data streams flowing forward , with the 
correct distance between data elements entering the 
butterfly, scheduled by proper delays. Both butterflies 
and multipliers are 50% utilised. The processor uses 
log 2 N-2 multipliers, log^N radix-2 butterflies and 3/2N - 
25 2 registers (delay elements). 

A R2SDF processor is illustrated in Figure 2 of the 
accompanying drawings. [£. H. Wold and A. M. Despain 
"Pipeline and parallel-pipeline FFT processors for VLSI 
implementation IEEE Trans. Comput , C-35 ( 5 ); 414-426 May 
1984]. This processor uses a radix-2 single path delay- 
feedback architecture* in which a more efficient use is 
made of registers, by storing the intermediate 
butterfly outputs in feedback shift registers. A single 
data stream passes through the multipliers at every 
35 stage. This architecture employs the same number of 



20 



30 
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butterfly units and multipliers as the R2MDC 
architecture, but has a much reduced memory requirement, 
namely N- 1 registers. In fact the memory requirement 
can be described as minimal. 



5 A R4SDF processor is illustrated in Figure 3 of the 

accompanying drawings » [see A, M. Oespain "Fourier 
transform computer using CQRDIC iterations" IEEE Trans 
Comput. C-23( 10) :993-101 Oct 1974}. This processor uses 
a radix-4 single path delay feedback architecture and is 

10 a radix-4 version of the R2SDF architecture employing 

CORDIC (Coordinated Rotational Digital Computer) 
iterations. The multiplier utilisation is increased to 
75/'- because of intermediate storage of 3 out of the ^ 
radix-4 butterfly outputs. However, the utilisation of 

15 the radix-4 butterfly, which is fairly complicated and 

requires at least 8 complex adders to implement, falls 
to 25% t [.see J, G. Proakis and D . G. Manolakis 
"Introduction co Signal Processing" Macmillan 1989]. A 
processor implemented in this architecture requires log<N 

20 - 1 multipliers. log 4 N full radix-4 butterflies and 

storage of size N - 1. 

A R4MDC processor is illustrated in Figure u of the 
accompanying drawings, [see L . R, Rabiner and B . Cold 

25 "Theory and Application of Digital Signal Processing" - 

Prentice-Hall 1975]. This processor uses a radix-* 
multi-path delay commutator architecture and is a radix- 
u version of the R2MDC architecture. It was used as the 
architecture for the initial implementation of pipeline 

30 FFT processors, [see E. E. Swartzlander et al . "A radix 

4 delay commutator for fast Fourier transform processor 
implementation" IEEE J. Solid-State Circuits, SC- 
19(5) : 702-709 Oct 1984], and massive wafer scale 
integration, [see E- Swartzlander et al "A radix 8 

3 5 wafer scale FFT processor'' J. VLSI Signal Processing 

4(2 9 3):165-176 May 1992). However, it suffers from a 
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Low, 25%, utilisation of all components. This can only 
be compensated for in some special applications where 
four FFTs are being processed simultaneously. 
Implementation of this processor requires 31og^N 
5 multipliers, log 4 N full radix-4 butterflies and 

registers . 

A R4SDC processor is illustrated in Figure 5 of the 
accompanying drawings, [see G. Bi and E. V. Jones "A 
pipelined FFT processor for word-sequential data" IEEE 
Trans Acoust., Speech, Signal Processing, 37(12) : 1982- 
1985 Dec 1989}. This processor uses a radix-4 single- 
path delay commutator architecture together with a 
modified radix-4 algorithm with programmable 1/4 radix-4 
butterflies to achieve a higher, 75%, utilisation of 
multipliers. A combined delay-commutator also reduces 
the memory requirements t in comparison with the R4MDC 
architecture, to 2N-2, from 5/2N-1. The butterfly and 
delay commutator become relatively complicated because 
of the programmability requirements. The R4SDC 

architecture has found application in building large 
single chip pipeline FFT processors for HDTV. 

A comparison of the processors described above 
reveals the distinctive advantages and disadvantages of 
the different architectures. The delay feedback 
25 architectures are always more efficient than the 

corresponding delay-commutator architectures in terms of 
memory utilisation, because the stored butterfly outputs 
can be directly used by the multipliers, Radix-4 
algorithm based single-path architectures have a higher 
30 multiplier utilisation. However, radix-2 architectures 

have simpler butterflies which are more efficiently 
utilised. The present invention is based, at least in 
part, on these observations. 

The present invention is a real-time pipeline 



15 
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processor which is particularly suited for VLSI 
implementation , The processor is based on a hardware 
oriented radix-2 2 algorithm derived by integrating a 
twiddle factor decomposition technique in a divide and 
5 conquer approach- The radix 2* algorithm has the same 

multiplicative complexity as a radix-4 algorithm, but 
retains the butterfly structure of a radix-2 algorithm. 
A single-path delay- feedback architecture is used in 
order to exploit the spatial regularity in the signal 
10 flow graph of the algorithm. For a length-N DFT 

transform, the hardware requirements of the processor 
proposed by the present invention is minimal on both 
dominant components: LogjN-1 complex multipliers, and N-i 
complex data memory. 



15 According to a first aspect of the present 

invention, there is provided a real-time pipeline fast 
fourier transform processor, characterised in that said 
processor includes a plurality of paired first and 
second butterfly means, each of said first butterfly 

20 means and each of said second butterfly means having a 

feedback path between an output therefrom to an input 
thereto, in that each of said paired butterfly means is 
linked by a multiplier to an adjacent one of said 
plurality of paired first and second butterfly means, in 

25 that an input data sequence is applied to an input of a 

first one of said plurality of paired first and second 
butterfly means, and in that an output data sequence is 
derived from a last one of said plurality of paired 
first and second butterfly means. 

30 Preferably, said processor is realised on a VLSI 

chip . 

Preferably, said processor operates on a radix-2 3 
algorithm having the same multiplicative complexity as 
a radix-4 algorithm but employing radix-2 butterflies. 
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Preferably, only a single data path exists between 
each butterfly means. 

Said first butterfly means may be radix-2 single 
delay feedback butterflies, and said second butterfly 
5 means may be radix-2 single delay feedback butterflies 

including logic circuitry to implement trivial twiddle 
factor multiplications. 

Said processor may include a synchronisation 
control means and an address means for twiddle factor 
10 reading for each processor stage- 

Said first butterfly means may include two adders, 
two subtractors, and four 2-to-l multiplexers. 

Said second butterfly means may include at least 
one adder, at least one subtractor, at least two 2-to-l 
multiplexers, a 2x2 commutator, and an AND gate with one 
inverting input and one non-inverting input, 



15 



20 



Control signals derived from said synchronisation 
control means may be applied to the inputs to said" AND 
gate. 

Said synchronisation control means and said address 
means may be implemented as a single binary counter. 

A pipeline register may be located between each 
multiplier and a following butterfly means - 

Said processor may include shimming registers for 
25 adjusting control signal tiding. 

Preferably, for a digital fourier transform of 
length N, said processor includes no more than log 4 N - l 
multipliers t 41og 4 N adders, and a memory size of N - l. 
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Said processor may be arranged to handle a 256 
point FFT, and said processor may have Tour processing 
stages in said pipeline, each processing stage separated 
by a multiplier, each processing stage comprising a 
S first butterfly means with a feedback register, and a 

second butterfly means with a feedback register. 



Said first butterfly means in said first stage may 
have a one hundred and twenty eight word feedback 
register, said second butterfly means in said first 

10 stage may have a sixty four word feedback register, said 

first butterfly means in said second stage may have a 
thirty two word feedback register, said second butterfly 
means in said second stage may have a sixteen word 
feedback register, said first butterfly means in said 

15 third stage may have an eight word feedback register, 

said second butterfly means in said third stage may have 
a four word feedback register, said first butterfly 
means in said fourth stage may have a two word feedback 
register and said second butterfly means in said fourth 

20 stage may have a one word feedback register. 

According to a second aspect of the present 
incention, there is provided a real-time pipeline fast 
fourier transform processor, characterised in that said 
processor operates on a radix-2 2 algorithm having the 
25 same multiplicative complexity as a radix-4 algorithm 

but employing radix-2 butterflies, and in that for a 
digital fourier transform of length N, said processor 
includes no more than log ; N - 1 multipliers, 41og t N 
adders, and a memory size of N - 1. 

30 Embodiments of the invention will now be described, 

by way of example, with reference to the accompanying 
drawings, in which: 

Figure 1 illustrates a known architecture for a 
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pipeline FFT processor designated as R2MDC. 

Figure 2 illustrates a known architecture for a 
pipeline FFT processor designated as R2SDF . 

Figure 3 illustrates a known architecture for a 
5 pipeline FFT processor designated as R&SDF- 

Figure 4 illustrates a known architecture for a 
pipeline FFT processor designated as R4MDC . 

Figure 5 illustrates a known architecture far a 
pipeline FFT processor designated as R4SDC. 

10 Figure 6 illustrates a radix-2 butterfly structure, 

for the present invention, obtained by twiddle 
factor decomposition . 

Figure 7 is a radix-2 2 D1F FFT flow graph for N - 
16- 

15 Figure 8 is a radi*-2 2 DIF FFT flow graph for N = 

Figure 9 illustrates a R2 2 SDF pipeline FFT 
architecture for N = 256 , according to the present 
invention. 

20 Figure 10 illustrates a first butterfly structure 

used in a R2 2 SDF pipeline FFT processor according to 
the invention. 

Figure 11 illustrates a second butterfly structure 
used in a R2*5DF pipeline FFT processor according to 
25 the invention. 

To facilitate an understanding of the present 
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invention a glossary of the abbreviations used in the 
specification is set out below: 

<.>»: denotes a residue modulo-N operation, e.g. 



<7>j - 1 and <7>, = 3 



5 Q( . ) : lover bound in asymptotic analysis 

AT*: reference to area-time complexity 

CFA: Common Factor Algorithm 

DFT: Digital, or Discete, Fourier Transform 

DIF: Decimation In Frequency (algorithm) - where a 

10 fast algorithm is derived using a divide and 

conquer approach, if the first step is to 
divide the input data sequence into a first 
and second half, equivalent to separating the 
frequency points by even and odd number, the 
15 algorithm is described as a DIF algorithm - in 

the SFG the frequency points will be in bit- 
reversed order 

DIT: Decimation In Time (algorithm) - where a fast 

algorithm is derived using a divide and 

20 conquer approach, if the first step is to 

divide the input data sequence into two, 
according to its even and odd numbered points, 
the algorithm is described as a DIT algorithm 
- in the SFG the input will be in bit reversed 

2 5 order 

FIFO: First In First Out 

R2*SDF: Radix-2 2 Single-path Delay Feedback 
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FFT: Fast Fourier Transform 

Metric: the Hamming distance between two code words - 
enables a determination of whether, or not, an 
architecture is optimal to be made 

5 O ( • ) • upper bound in asymptotic analysis 

" 0<N Z ): means that the growth rate, for N sufficiently 

large, is no greater than N 2 

PE; Processing Element 

SDF: Single-path Delay Feedback 

10 SFG; Signal Flow Graph 

VLSI: Very Large Scale Integration 

From the observations made in the introduction to 
this patent specification, it can be seen that a 
comparison of different architectures for pipeline FFT 

15 processors shows that the most desirable hardware 

oriented algorithm will have the same number of non- 
trivial multiplications, at the same position in the 
flowgraph, as a radix-* algorithm, but will retain the 
butterfly structure of a radix-2 algorithm. This 

20 feature appears in a number of known algorithms. A SFG 

has been obtained, within a complex "bias" factor, as a 
result of a constant-rotation/compensation procedure 
using restricted CORDXC operations, [see A . M. Despain 
"Very fast Fourier transform algorithms hardware for 

25 Implementation" IEEE Trans. Camput. C-28{ 5 ): 333-341 May 

1979}. Another algorithm combining radix-4 and radix^ 
'4 + 2', in DIT form, has been used to reduce the scaling 
noise in a R2MDC architecture, without altering the 
multiplier requirement, [see R. Storn "RadiM-2 FFT- 
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pipeline architecture with reduced noise- to-signal 
ratio" IEE Proc -Vis. Image Signal Process. i4i(2):8l-86 
Apr* 2994). A clear derivation of an algorithm in DIF 
form directed to the reduction of hardware requirements 
in the context of pipeline FFT processors has, until 
now, not been derived. 



To avoid confusion with the well known radix-2/4 
algorithm and the mixed radix-*A*2' FFT algorithm, [^ee 
£. O. Brighant "The fast Fourier cransform and its 

IQ applications Prentice-Hall 1988], the algorithm on which 

the present invention is based is referred to as the 
radix-2 2 algorithm. This notation clearly reflects the 
structural relationship of this algorithm to the radix- 
2 algorithm and the identity between the computational 

15 requirements of this algorithm and the radix-* 

algorithm. 

A DFT of size N is defined by: 



XiJc)^Tx{n)hr^ 0*k<N (1) 



N 



20 where W B denotes the Nth primitive root of unity, with 

its exponent evaluated modulo N. The DFT coefficients 
are "rotating factors" , which are constant length 
vectors in the complex plane with different phase* or 
rotation angles. The constant-rotation/compensation 

25 procedure proposed by Despain is based on the idea that 

given a complex bias, all angles can be rotated by 
successive rotations of a fixed, constant: angle. This 
bias can be compensated at the final stage of the 
computation. To make the derivation of the new 

30 algorithm clearer, consider the first 2 steps in a 

radix-2 DIF FFT decomposition together. Applying a 3- 
dimensional linear map, 
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k - <Jc x +2k 2 +4k 3 >N 



the CFA algorithm has the form of 



xik^k^iks) 

n >*° "a'° . o t 



where T=w N 4 



If the expression within the braces is computed with a 
5 butterfly structure before further decomposition, an 

ordinary radix-2 DIF FFT will be obtained. The key 
concept behind the new algorithm is to extend the second 
stage of decomposition to the remaining DFT 
coefficients, including the twiddle factor 



10 to exploit the exceptional values in multiplication 

before the butterfly is constructed. Decompose the 
composite twiddle factor observing that;- 



Substituting equation (4) into equation (3) and 
IS expanding the summation with index n 2 , yields, after 
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simplification, a set of four DFTs of length N/^ * 



where 



5 The expressions in [] on the right hand side of equation 

(6) represent a first butterfly BF I, the entire 
expression on the right hand side of equation (6) 
represents a second butterfly BF II. Equation (6) 
represents the First two stages of butterflies with only 
10 trivial multiplication in the flow graph, as BF I and BF 

II, in Figure 6. After these two stages, full 
multipliers are required to compute the multiplications 
by the decomposed twiddle factor: 

in equation 6, as shown in Figure 6. It should be noted 
15 that the order of the twiddle factors is different from 

that of a radix-4 algorithm. 

Applying this CFA procedure recursively to the 
remaining DFTs of length N/4 in equation (5), yields the 
complete radix-2* DIF FFT algorithm. An N = 16 example 
20 i s shown in figure 7 and an N * 64 example is shown in 

Figure 8. In Figure 8 the small diamonds represent 
trivial multiplication by: 
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H 

which involves only real-imaginary swapping and sign 
inversion . 

The radix-2* algorithm has the same multiplicative 
complexity as the radix-4 algorithm, but retains the 
5 radix-2 butterfly structure. The multiplicative 

operations are so arranged that only every other stage 
has non-trivial multiplications. This is a substantial 
structural advantage over other algorithms when 
pipeline/cascade FFT architectures are considered. 

10 By applying the radix^2 Z DIF FFT algorithm t derived 

above, to a R2SDF architecture, the new and efficient 
radix-2 2 SDF architecture of the present invention is 
obtained. . This architecture requires a minimum of 
hardware resource, compared with the known architectures 

15 discussed in the introduction to this specification, 

because of the reduced multiplicative complexity and the 
preservation of spatial regularity for both additive and 
multiplicative operations in the SFG, as shown' in 
Figures 7 and 8. 

20 Figure 9 illustrates the architecture of a real- 

time pipeline FFT processor, according to the present 
invention, for N=256. The similarity between the data 
path in this processor and the R2SDF architecture and 
the reduced number of multipliers should be noted. 

25 Referring now to Figure 9, the input data sequence 

is passed to the first. 9, of a pair butterfly units, 9 
and 10. A one hundred and twenty eight word feedback 
register, 1, links the output of butterfly 9. to its 
input- The second butterfly unit 10 has a sixty four 

30 word feedback register 2. Multiplier 17 links the first 
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stage of the processor, comprising butterfly units 9 and 
10 to the second stage of the processor comprising 
butterfly units 11 and 12, and multiplies the data 
stream by the twiddle factor WKn). It should be noted 
5 at this point that the structure of butterfly units 9, 

11, 13 and 15, differs from butterfly units 10, 12, 1*. , 
and 16, see below- Butterfly units 11 and 12 are 
provided with feedback registers 3 and 4 having a thirty 
two word and sixteen word capacity respectively. A 
10 multiplier 17 > located between the second and third 

stage of the processor, multiplies the data stream by 
twiddle factor W2(n). The third stage of the processor 
comprises butterflies 13 and 14, together with eight 
word feedback register 5 and four word feedback register 
!5 Again a multiplier 17, located between the third and 

fourth stages, of the processor multiplies the data 
stream by twiddle factor W3 ( n ) . The fourth stage of the 
processor .comprises butterfly units 15 and 16 together 
with two word feedback register 7 and one word feedback 
20 register 8. The output sequence, X{k) is derived from 

the output of the fourth stage of the processor. The 
binary counter 18. is clocked by a clock signal 19, The 
binary counter 18 acts as a synchronisation controller 
and address counter for the twiddle factors used between 
25 each stage of the processor. 

The two types of butterfly used in the processor 
are DF2I and BF211. The BF2I butterfly is similar to 
the R2SDF butterfly- The BF2I1 butterfly contains the 
logic needed to implement the trivial twiddle factor 
30 multiplications. Because of the spatial regularity of 

the radix-2 2 algorithm, the synchronisation control of 
the processor is particularly simple and is, as 
described above, implemented with a (log 2 N)-bit counter. 

The type BF2I butterfly is illustrated in Figure 
35 xo> The butterfly unit comprises two adders, 21, two 



PAGE 30/1 14 • RCVD AT 12/712005 3:03:37 PM [Eastern Standard Time] * SVR:USPTO-EFXRF-6/37 * DNIS:2738300 " CSID:61 3 230 8842 ■ DURATION (mm-ss):29-32 



12/07/2005 15:10 FAX 613 230 8842 BORDEN LADNER @J 03 1 /1 1 4 

WO 97/19412 PCT/SE96/0O246 

i 

- 19 - 

subcractors, 22. and four multiplexers 23, connected as 
shown in Figure 10. Operation of the multiplexers is 
controlled by control signal 27, as will be explained 
later. 

5 The type BF2II butterfly is illustrated in Figure 

11. It is similar in construction to the type BF2I 
butterfly, but vith che addition of a 2x2 commutator, 
26» and a logic gate, 24. The logic gate 24 is an AND 
gate with one inverted input. Control signal 25 is 
10 applied to the inverted input of AND gate 24, and 

control signal, 27, which is also applied to the 
multiplexers 23, is applied to the non-inverted input of 
AND gate 24, The output from AND gate 24 drives 
commutator 26. 

15 The scheduled operation of the R2 2 SDF processor is 

as follows. On the first N/2 cycles, the 2-to-l 
multiplexers, 23, in the first butterfly module switch 
to position "O' 1 , and the butterfly is idle. The input 
data from che left is directed to the shift registers 

20 until they are filled. On the next N/2 cycles , the 

multiplexers, 23, turn to position ,, 1 , \ the butterfly 
unit computes a 2-point DFT with the incoming data and 
the data stored in the shift registers. 

Zl in) = jc(ji) +Jc(j3*Jtf/2) 

Q*n<N/2 (7) 

Zl{n+N/2) »x(/J) -xin+N/2) 



25 The butterfly output Zl(n) is sent to apply the 

twiddle factor and Zl(n+N/2) is sent back to the shift 
registers to be "multiplied'* in the next N/2 cycles when 
the fi rst half of the next frame of the time sequence is 
loaded. The operation of the second butterfly is 

30 similar to that of the first one, except the "distance" 

of the butterfly input sequence is just N/4 and the 
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trivial twiddle factor multiplication is implemented by 
real -imaginary swapping by commutator 26 and controlled 
add/subtract operations. This requires a two bit 
control signal, 25 and 27, from the synchronising 
5 counter, 18. The data then passes through a full 

complex multiplier, 17, working at 75% utility, to 
produce the results of the first level of radix-4 DFT 
word by word. Further processing repeats this pattern 
with the distance of the input data decreasing by half 

10 at each consecutive butterfly stage. After N~l clock 

cycles, the complete DFT transform result streams out to 
the right of processor* see Figure 9, in bit-reversed 
order. The next frame of the transform can then be 
processed without pausing, because of the pipelined 

15 processing at each stage of the processor. 

In a practical implementation of the radix-2 2 SDF 
processor pipeline registers should be inserted between 
each multiplier and butterfly stage to improve 
performance. Shimming registers are also needed so that 
20 the control signals comply with the revised timing. The 

latency of the output is then increased to N-l + 3 ( logj N-l ) 
without affecting the throughput rate. 

The hardware requirements of radix-2 2 SDF processor 
architecture, as compared with known architectures, is 
25 set out in the table below, which lists the number of 

complex multipliers, adders, memory size and control 
complexity. 





multiplier 

// 


adder 


memory 
size 


control 


R2MDC 


2(log 4 N-l) 


41og i N 


3N/2-2 


simple 


R2SDF 


2<log 4 N-l) i 


4l0g 4 N 


N-l 


simple 


R6SDF 


log ; N-l 


810g|N 


N-l 


medium 
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R4MDC 




Blo^N 


5N/2-* 


simple 


RA5DC 


log^N-l 


31og<N 


2N-2 


complex 


R2 2 SDF 


log^N-l 


4l0g c N 


N-l 


simple 



The table shows that the R2 Z SDF architecture has 
5 reached the minimum requirement for both multipliers and 

storage requirements , and is second, v/ith regard to the 
number of adders, to only the R4SDC architecture. This 
means that the R2 2 SDF architecture is ideal for the VLSI 
implementation of pipeline FFT processors. 
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CLAIMS 

1. A real-time pipeline fast fourier transform 
processor, characterised in that said processor includes 
a plurality of paired first and second butterfly means, 

5 each of said first butterfly means and each of said 

second butterfly means having a feedback path between an 
output therefrom to an input thereto, in that each of 
said paired butterfly means is linked by a multiplier to 
an adjacent one of said plurality of paired first and 

20 second butterfly means, in that an input data sequence 

is applied to an input of a first one of said plurality 
of paired first and second butterfly means, and in that 
an output data sequence is derived from a last one of 
said plurality of paired first and second butterfly 

15 means. 

2. A real-time pipeline fast fourier transform 
processor as claimed in claim 1, characterised in that 
said processor is realised on a VLSI chip. 



3, A real-time pipeline fast fourier transform 
20 processor as claimed in either claim 1, or claim 2» 

characterised in that said processor operates on a 
radix-2 2 algorithm having the same multiplicative 
complexity as a radix-4 algorithm but employing radix-2 
butterflies - 

25 a. A real-time pipeline fast fourier transform 

processor as claimed in any previous claim, 
characterised in that only a single data path exists 
between each butterfly means. 

5, a real-time pipeline fast fourier transform 
30 processor as claimed in any previous claim, 

characterised in that said first butterfly means are 
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radix-2 single delay feedback butterflies, and in that 
said second butterfly means are radix-2 single delay 
feedback butterflies including logic circuitry to 
implement trivial twiddle factor multiplications. 



5 6, A real-time pipeline fast fourier transform 

processor as claimed in any previous claim* 
characterised in that said processor includes a 
synchronisation control means and an address means for 
twiddle factor reading for each processor stage. 

10 7, A real-time pipeline fast fourier transform 

processor as claimed in either claim 5 or 6, 
characterised in that said first butterfly means 
includes two adders, two subcractors, and four 2-to-l 
multiplexers - 



8. A real-time pipeline fast fourier transform 
processor as claimed in any of claims 5 to 7, 
characterised in that said second butterfly means 
includes at least one adder* at least one subtractor, at 
least two 2-to-l multiplexers, a 2x2 commutator, and an 
AND gate with one inverting input and one non-inverting 
input . 



9. A real-time pipeline fast fourier transform 
processor as claimed in claim 8, when appended to claim 
6, or 7, characterised in that control signals derived 

25 from said synchronisation control means are applied to 

the inputs to said AND gate. 

10. A real-time pipeline fast fourier transform 
processor as claimed in any of claims 6 to 9 t 
characterised in that said synchronisation control means 

30 and said address means are implemented as a single 

binary counter. 
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11. A real-time pipeline fast fourier transform 
processor as claimed in any previous claim* 
characterised in that a pipeline register is located 
between each multiplier and a following butterfly means. 

5 12. A real-time pipeline fast fourier transform 

processor as claimed in claim 11, characterised in that 
said processor includes shimming registers for adjusting 
control signal timing. 

13. A real-time pipeline fast fourier transform 
10 processor as claimed in any previous claim, 

characterised in that for a digital fourier transform of 
length N , said processor includes no more than log t N - 1 
multipliers, 41og 4 N adders, and a memory size of N - J. 

14. A real-time pipeline fast fourier transform 
15 processor . as claimed in any previous claim* 

characterised in that said processor is arranged to 
handle a 256 point FFT, in that said processor has four 
processing stages in said pipeline, each processing 
stage being separated by a multiplier, and in that each 
20 processing stage comprises a first butterfly means with 

a feedback register, and a second butterfly means with 
a feedback register. 

15. A real-time pipeline fast fourier transform 
processor as claimed in claim 14, characterised in that 

25 said first butterfly means in said first stage has a one 

hundred and twenty eight word feedback register, said 
second butterfly means in said first stage has a sixty 
four word feedback register* said first butterfly means 
in said second stage has a thirty two word feedback 

30 register, said second butterfly means in said second 

stage has a sixteen word feedback register, said first 
butterfly means in said third stage has an eight word 
feedback register, said second butterfly means in said 
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third stage has a four word feedback, register > said 
first butterfly means in said fourth stage has a two 
word feedback register and said second butterfly means 
in said fourth stage has a one word feedback register. 

5 16. A real-time pipeline fast fourier transform 

processor , characterised in that said processor operates 
on a radiX'2 z algorithm having the same multiplicative 
complexity as a radix-4 algorithm but employing radix-2 
butterflies, and in that for a digital fourier transform 
10 of length N, said processor includes no more than 

log t N - 1 multipliers* 41og ( N adders* and a memory size 
of N - 1 . 
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